The goal of this project is to predict fraudulent credit card transactions.
The data set we will use consists of credit card transactions and it includes information about each transaction including customer details, the merchant and category of purchase, and whether or not the transaction was a fraud.
## Rows: 671,028
## Columns: 14
## $ trans_date_trans_time <dttm> 2019-02-22 07:32:58, 2019-02-16 15:07:20, 2019-…
## $ trans_year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2020, …
## $ category <chr> "entertainment", "kids_pets", "personal_care", "…
## $ amt <dbl> 7.79, 3.89, 8.43, 40.00, 54.04, 95.61, 64.95, 3.…
## $ city <chr> "Veedersburg", "Holloway", "Arnold", "Apison", "…
## $ state <chr> "IN", "OH", "MO", "TN", "CO", "GA", "MN", "AL", …
## $ lat <dbl> 40.1186, 40.0113, 38.4305, 35.0149, 39.4584, 32.…
## $ long <dbl> -87.2602, -80.9701, -90.3870, -85.0164, -106.385…
## $ city_pop <dbl> 4049, 128, 35439, 3730, 277, 1841, 136, 190178, …
## $ job <chr> "Development worker, community", "Child psychoth…
## $ dob <date> 1959-10-19, 1946-04-03, 1985-03-31, 1991-01-28,…
## $ merch_lat <dbl> 39.41679, 39.74585, 37.73078, 34.53277, 39.95244…
## $ merch_long <dbl> -87.52619, -81.52477, -91.36875, -84.10676, -106…
## $ is_fraud <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
We also add some variables to isolate the hour of the day, day of the week, month etc. corresponding to the transactions as well as the age of the customers to assist in exploratory data analysis:
# use lubridate to isolate day of week, month, hour of day, age etc.
card_fraud <- card_fraud %>%
mutate( hour = hour(trans_date_trans_time),
wday = wday(trans_date_trans_time, label = TRUE),
month_name = month(trans_date_trans_time, label = TRUE),
age = interval(dob, trans_date_trans_time) / years(1)
) %>%
rename(year = trans_year) %>%
# use lat, long to calculate distance between transaction and cardholder's home
mutate(
# convert latitude/longitude to radians
lat1_radians = lat / 57.29577951,
lat2_radians = merch_lat / 57.29577951,
long1_radians = long / 57.29577951,
long2_radians = merch_long / 57.29577951,
# calculate distance in miles
distance_miles = 3963.0 * acos((sin(lat1_radians) * sin(lat2_radians)) + cos(lat1_radians) * cos(lat2_radians) * cos(long2_radians - long1_radians)),
# calculate distance in km
distance_km = 6377.830272 * acos((sin(lat1_radians) * sin(lat2_radians)) + cos(lat1_radians) * cos(lat2_radians) * cos(long2_radians - long1_radians))
)Let’s explore the data set and understand some useful features of it.
# group transactions by year
card_fraud %>%
group_by(year) %>%
# count number of fraudulent and non- fraudulent transactions
count(is_fraud) %>%
# calculate variable for frequency of fraud
mutate(percentage = n/sum(n) *100) ## # A tibble: 4 × 4
## # Groups: year [2]
## year is_fraud n percentage
## <dbl> <fct> <int> <dbl>
## 1 2019 1 2721 0.568
## 2 2019 0 475925 99.4
## 3 2020 1 1215 0.632
## 4 2020 0 191167 99.4
Approximately 0.6% of transactions were fraudulent in both 2019 and 2020.
Next, examine date/time variables: when does fraud occur? Are there some weekdays, months or hours of the day when fraud occurs more frequently? :
# plot bar graph to investigate fraud by day of the week
card_fraud %>%
# filter only for fraudulent transactions
filter(is_fraud==1) %>%
# count number of transactions of fraud by weekday
group_by(wday) %>%
count() %>%
# plot bars for weekdays
ggplot(aes(x = wday, y = n)) +
geom_bar(stat="identity") +
# add some text to show count of transactions for each day
geom_text(aes(label = n, y= n - 10),
colour = "white", size = 4 , vjust = 2) +
labs(x='', y='',
title = "Number of occurences of fraud by weekday")# plot bar graph to investigate fraud by month of the year
card_fraud %>%
# filter only for fraudulent transactions
filter(is_fraud==1) %>%
# count number of transactions of fraud by month
group_by(month_name) %>%
count() %>%
# plot bars for months
ggplot(aes(x = month_name, y = n)) +
geom_bar(stat="identity") +
# add some text to show count of transactions for each month
geom_text(aes(label = n, y= n - 10),
colour = "white", size = 4 , vjust = 2) +
labs(x='', y='',
title = "Number of occurences of fraud by month")# make a table to investigate which hours of the day are most affected
card_fraud %>%
# filter only for fraudulent transactions
filter(is_fraud==1) %>%
# count number of transactions of fraud by hour of the day
group_by(hour) %>%
count() %>%
# arrange from most fraud occurences to least
arrange(desc(n))## # A tibble: 24 × 2
## # Groups: hour [24]
## hour n
## <int> <int>
## 1 23 1012
## 2 22 981
## 3 0 348
## 4 1 332
## 5 3 326
## 6 2 313
## 7 19 52
## 8 18 49
## 9 17 48
## 10 13 45
## # ℹ 14 more rows
Fraud seems to occur far more frequently during night time hours when people are asleep and more frequently on weekend days as these are generally the times during which people are paying less attention to their bank accounts. Furthermore, fraud is far more common in the first six months of the year than the last. It is unclear why.
Next, we turn our attention to the numerical variables: what is the distribution of amounts of fraudulent and legitimate transactions? :
# some quick summary stats using summarize
# group by fraud versus legitimate
card_fraud %>%
group_by(is_fraud) %>%
# calculate summary stats of amounts for each
summarize(mean_amount= mean(amt),
median_amount = median(amt),
minimum_amount = min(amt),
maximum_amount = max(amt))## # A tibble: 2 × 5
## is_fraud mean_amount median_amount minimum_amount maximum_amount
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 527. 369. 1.06 1334.
## 2 0 67.6 47.2 1 27120.
# now plot histogram of fraudulent transaction amounts
card_fraud %>%
# only fraudulent transactions
filter(is_fraud==1) %>%
# plot distribution - make bins look as good as possible
ggplot(aes(x = amt)) +
geom_histogram(bins=50) +
labs(title = "Distribution of transactions amounts
\nof fraudulent transactions",
x = "Transaction amount / $", y="")We see that the distribution of fraudulent transactions has a mean of about $500 and barely any amounts go above $1000. This makes sense as fraudsters would want transactions to be worth it, but without attracting too much attention. Many tiny transactions or one very large transaction would get the attention of the victim more quickly. Contrarily, legitimate transactions obviously have a much wider range and lower mean/median value.
Now let’s consider geospatial variables: select the 400 cities (roughly half of the number of cities represented) with the most legitimate/fraud transactions and then use leaflet library to help us plot an interactive map of the lat and long locations of where legitimate/fraud transactions took place:
# arrange cities in order of number of transactions
top_fraud_cities <- card_fraud %>%
count(city) %>%
arrange(desc(n)) %>%
# slice 400 most represented cities
slice(1:400) %>%
select(city) %>%
pull()
# create colour vector
colours <- c('#e41a1c','#377eb8')
# create function that assigns colours to fraud/legit transactions
point_colour <- colorFactor(palette = colours,
card_fraud$is_fraud)
# filter only for top fraud cities
card_fraud %>%
filter(city %in% top_fraud_cities) %>%
# use leaflet and openstreetmaps for interactive map
leaflet() %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
# create circles at lat and long locations
addCircleMarkers(lng = ~long,
lat = ~lat,
radius = 1,
# colour fraud/legit transaction points
color = ~point_colour(is_fraud),
fillOpacity = 0.6,
label = ~is_fraud) %>%
# map legend
addLegend("bottomright", pal = point_colour,
values = ~is_fraud,
title = "Fraud")